Improved Statistical Machine Translation Using Paraphrases
نویسندگان
چکیده
Parallel corpora are crucial for training SMT systems. However, for many language pairs they are available only in very limited quantities. For these language pairs a huge portion of phrases encountered at run-time will be unknown. We show how techniques from paraphrasing can be used to deal with these otherwise unknown source language phrases. Our results show that augmenting a stateof-the-art SMT system with paraphrases leads to significantly improved coverage and translation quality. For a training corpus with 10,000 sentence pairs we increase the coverage of unique test set unigrams from 48% to 90%, with more than half of the newly covered items accurately translated, as opposed to none in current approaches.
منابع مشابه
Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases
Untranslated words still constitute a major problem for Statistical Machine Translation (SMT), and current SMT systems are limited by the quantity of parallel training texts. Augmenting the training data with paraphrases generated by pivoting through other languages alleviates this problem, especially for the so-called “low density” languages. But pivoting requires additional parallel texts. We...
متن کاملParaphrasing with Bilingual Parallel Corpora
Previous work has used monolingual parallel corpora to extract and generate paraphrases. We show that this task can be done using bilingual parallel corpora, a much more commonly available resource. Using alignment techniques from phrasebased statistical machine translation, we show how paraphrases in one language can be identified using a phrase in another language as a pivot. We define a para...
متن کاملUsing Paraphrases for Parameter Tuning in Statistical Machine Translation
Most state-of-the-art statistical machine translation systems use log-linear models, which are defined in terms of hypothesis features and weights for those features. It is standard to tune the feature weights in order to maximize a translation quality metric, using held-out test sentences and their corresponding reference translations. However, obtaining reference translations is expensive. In...
متن کاملImproved Statistical Machine Translation Using Monolingual Paraphrases
We propose a novel monolingual sentence paraphrasing method for augmenting the training data for statistical machine translation systems “for free” – by creating it from data that is already available rather than having to create more aligned data. Starting with a syntactic tree, we recursively generate new sentence variants where noun compounds are paraphrased using suitable prepositions, and ...
متن کاملUsing Word Embeddings for Improving Statistical Machine Translation of Phrasal Verbs
We examine the employment of word embeddings for machine translation (MT) of phrasal verbs (PVs), a linguistic phenomenon with challenging semantics. Using word embeddings, we augment the translation model with two features: one modelling distributional semantic properties of the source and target phrase and another modelling the degree of compositionality of PVs. We also obtain paraphrases to ...
متن کامل